We address the problem of unsupervised domain adaptation when the source domain differs from the target domain because of a shift in the distribution of a latent subgroup. When this subgroup confounds all observed data, neither covariate shift nor label shift assumptions apply. We show that the optimal target predictor can be non-parametrically identified with the help of concept and proxy variables available only in the source domain, and unlabeled data from the target. The identification results are constructive, immediately suggesting an algorithm for estimating the optimal predictor in the target. For continuous observations, when this algorithm becomes impractical, we propose a latent variable model specific to the data generation process at hand. We show how the approach degrades as the size of the shift changes, and verify that it outperforms both covariate and label shift adjustment.
translated by 谷歌翻译
因果机器学习(CAUSALML)是机器学习方法的伞术语,将数据生成过程正式化为结构性因果模型(SCM)。这样一来,人们就可以推论此过程的变化(即干预措施)以及事后发生的事情(即反事实)。我们根据他们解决的问题将工作分为五组:(1)因果监督学习,(2)因果生成建模,(3)因果解释,(4)因果公平,(5)因果关系。对于每个类别,我们会系统地比较其方法并指出开放问题。此外,我们回顾了计算机视觉,自然语言处理和图形表示学习中特定于模式的应用。最后,我们概述了因果基准和对这个新生领域状态的批判性讨论,包括对未来工作的建议。
translated by 谷歌翻译
当原因因错误破坏时,我们提出了基于内核的非参数估计量。我们通过在仪器变量设置中概括估计来做到这一点。尽管在测量误差和测量误差方面进行了重大研究,但在连续环境中处理未观察的混杂件是不平凡的:我们几乎看不到先前的工作。作为我们调查的副产品,我们阐明了平均嵌入和特征功能之间的联系,以及如何同时学习一个人学习另一个人。这为内核方法研究开辟了道路,以利用特征功能估计的现有结果。最后,我们从经验上表明,我们提出的方法MEKIV在测量误差的强度和误差分布的类型上的变化下改善了基线,并且在变化下是可靠的。
translated by 谷歌翻译
图表自学学习(GSSL)铺平了无需专家注释的学习图嵌入的方式,这对分子图特别有影响,因为可能的分子数量很大,并且标签昂贵。但是,通过设计,GSSL方法没有经过训练,可以在一个下游任务上表现良好,而是旨在将其转移到许多人方面,从而使评估不那么直接。作为获得具有多种多样且可解释属性的分子图嵌入曲线的一步,我们引入了分子图表示评估(Molgrapheval),这是一组探针任务,分为(i)拓扑 - ,(ii)子结构 - 和(iii)和(iii)嵌入空间属性。通过对现有下游数据集和Molgrapheval上的现有GSSL方法进行基准测试,我们发现单独从现有数据集中得出的结论与更细粒度的探测之间存在令人惊讶的差异,这表明当前的评估协议没有提供整个图片。我们的模块化,自动化的端到端GSSL管道代码将在接受后发布,包括标准化的图形加载,实验管理和嵌入评估。
translated by 谷歌翻译
因果效应估计对于自然和社会科学中的许多任务很重要。但是,如果没有做出强大的,通常无法测试的假设,就无法从观察数据中识别效果。我们考虑了部分识别问题的算法,当未衡量的混淆使鉴定不可能鉴定时,多变量,连续处理的界限治疗效果。我们考虑一个框架,即可观察的证据与基于规范标准在因果模型中编码的约束的含义相匹配。这纯粹是基于生成模型来概括经典方法。将因果关系施放为在受约束优化问题中的目标函数,我们将灵活的学习算法与蒙特卡洛方法相结合,以随机因果节目的名义实施解决方案家族。特别是,我们提出了可以通过因果或观察到的数据模型而没有可能性功能的参数功能的这种约束优化问题的方式,从而降低了任务的计算和统计复杂性。
translated by 谷歌翻译
我们解决了在没有观察到的混杂的存在下的因果效应估计的问题,但是观察到潜在混杂因素的代理。在这种情况下,我们提出了两种基于内核的方法,用于非线性因果效应估计:(a)两阶段回归方法,以及(b)最大矩限制方法。我们专注于近端因果学习设置,但是我们的方法可以用来解决以弗雷霍尔姆积分方程为特征的更广泛的逆问题。特别是,我们提供了在非线性环境中解决此问题的两阶段和矩限制方法的统一视图。我们为每种算法提供一致性保证,并证明这些方法在合成数据和模拟现实世界任务的数据上获得竞争结果。特别是,我们的方法优于不适合利用代理变量的早期方法。
translated by 谷歌翻译
Machine learning can impact people with legal or ethical consequences when it is used to automate decisions in areas such as insurance, lending, hiring, and predictive policing. In many of these scenarios, previous decisions have been made that are unfairly biased against certain subpopulations, for example those of a particular race, gender, or sexual orientation. Since this past data may be biased, machine learning predictors must account for this to avoid perpetuating or creating discriminatory practices. In this paper, we develop a framework for modeling fairness using tools from causal inference. Our definition of counterfactual fairness captures the intuition that a decision is fair towards an individual if it is the same in (a) the actual world and (b) a counterfactual world where the individual belonged to a different demographic group. We demonstrate our framework on a real-world problem of fair prediction of success in law school. * Equal contribution. This work was done while JL was a Research Fellow at the Alan Turing Institute. 2 https://obamawhitehouse.archives.gov/blog/2016/05/04/big-risks-big-opportunities-intersection-big-dataand-civil-rights 31st Conference on Neural Information Processing Systems (NIPS 2017),
translated by 谷歌翻译
Remote sensing imagery provides comprehensive views of the Earth, where different sensors collect complementary data at different spatial scales. Large, pretrained models are commonly finetuned with imagery that is heavily augmented to mimic different conditions and scales, with the resulting models used for various tasks with imagery from a range of spatial scales. Such models overlook scale-specific information in the data. In this paper, we present Scale-MAE, a pretraining method that explicitly learns relationships between data at different, known scales throughout the pretraining process. Scale-MAE pretrains a network by masking an input image at a known input scale, where the area of the Earth covered by the image determines the scale of the ViT positional encoding, not the image resolution. Scale-MAE encodes the masked image with a standard ViT backbone, and then decodes the masked image through a bandpass filter to reconstruct low/high frequency images at lower/higher scales. We find that tasking the network with reconstructing both low/high frequency images leads to robust multiscale representations for remote sensing imagery. Scale-MAE achieves an average of a $5.0\%$ non-parametric kNN classification improvement across eight remote sensing datasets compared to current state-of-the-art and obtains a $0.9$ mIoU to $3.8$ mIoU improvement on the SpaceNet building segmentation transfer task for a range of evaluation scales.
translated by 谷歌翻译
With the rise in high resolution remote sensing technologies there has been an explosion in the amount of data available for forest monitoring, and an accompanying growth in artificial intelligence applications to automatically derive forest properties of interest from these datasets. Many studies use their own data at small spatio-temporal scales, and demonstrate an application of an existing or adapted data science method for a particular task. This approach often involves intensive and time-consuming data collection and processing, but generates results restricted to specific ecosystems and sensor types. There is a lack of widespread acknowledgement of how the types and structures of data used affects performance and accuracy of analysis algorithms. To accelerate progress in the field more efficiently, benchmarking datasets upon which methods can be tested and compared are sorely needed. Here, we discuss how lack of standardisation impacts confidence in estimation of key forest properties, and how considerations of data collection need to be accounted for in assessing method performance. We present pragmatic requirements and considerations for the creation of rigorous, useful benchmarking datasets for forest monitoring applications, and discuss how tools from modern data science can improve use of existing data. We list a set of example large-scale datasets that could contribute to benchmarking, and present a vision for how community-driven, representative benchmarking initiatives could benefit the field.
translated by 谷歌翻译
Massive data corpora like WebText, Wikipedia, Conceptual Captions, WebImageText, and LAION have propelled recent dramatic progress in AI. Large neural models trained on such datasets produce impressive results and top many of today's benchmarks. A notable omission within this family of large-scale datasets is 3D data. Despite considerable interest and potential applications in 3D vision, datasets of high-fidelity 3D models continue to be mid-sized with limited diversity of object categories. Addressing this gap, we present Objaverse 1.0, a large dataset of objects with 800K+ (and growing) 3D models with descriptive captions, tags, and animations. Objaverse improves upon present day 3D repositories in terms of scale, number of categories, and in the visual diversity of instances within a category. We demonstrate the large potential of Objaverse via four diverse applications: training generative 3D models, improving tail category segmentation on the LVIS benchmark, training open-vocabulary object-navigation models for Embodied AI, and creating a new benchmark for robustness analysis of vision models. Objaverse can open new directions for research and enable new applications across the field of AI.
translated by 谷歌翻译